315 research outputs found

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    Information Extraction in Illicit Domains

    Full text link
    Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.Comment: 10 pages, ACM WWW 201

    Intelligent Self-Repairable Web Wrappers

    Get PDF
    The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

    An Automated Algorithm for Extracting Website Skeleton

    Get PDF
    The huge amount of information available on the Web has attracted many research e#orts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision

    Electronic and optical properties of electromigrated molecular junctions

    Full text link
    Electromigrated nanoscale junctions have proven very useful for studying electronic transport at the single-molecule scale. However, confirming that conduction is through precisely the molecule of interest and not some contaminant or metal nanoparticle has remained a persistent challenge, typically requiring a statistical analysis of many devices. We review how transport mechanisms in both purely electronic and optical measurements can be used to infer information about the nanoscale junction configuration. The electronic response to optical excitation is particularly revealing. We briefly discuss surface-enhanced Raman spectroscopy on such junctions, and present new results showing that currents due to optical rectification can provide a means of estimating the local electric field at the junction due to illumination.Comment: 19 pages, 8 figures, invited paper for forthcoming special issue of Journal of Physics: Condensed Matter. For other related papers, see http://www.ruf.rice.edu/~natelson/publications.htm

    Large-Scale Atomistic Simulations of Environmental Effects on the Formation and Properties of Molecular Junctions

    Full text link
    Using an updated simulation tool, we examine molecular junctions comprised of benzene-1,4-dithiolate bonded between gold nanotips, focusing on the importance of environmental factors and inter-electrode distance on the formation and structure of bridged molecules. We investigate the complex relationship between monolayer density and tip separation, finding that the formation of multi-molecule junctions is favored at low monolayer density, while single-molecule junctions are favored at high density. We demonstrate that tip geometry and monolayer interactions, two factors that are often neglected in simulation, affect the bonding geometry and tilt angle of bridged molecules. We further show that the structures of bridged molecules at 298 and 77 K are similar.Comment: To appear in ACS Nano, 30 pages, 5 figure

    Logic, Probability and Action: A Situation Calculus Perspective

    Get PDF
    The unification of logic and probability is a long-standing concern in AI, and more generally, in the philosophy of science. In essence, logic provides an easy way to specify properties that must hold in every possible world, and probability allows us to further quantify the weight and ratio of the worlds that must satisfy a property. To that end, numerous developments have been undertaken, culminating in proposals such as probabilistic relational models. While this progress has been notable, a general-purpose first-order knowledge representation language to reason about probabilities and dynamics, including in continuous settings, is still to emerge. In this paper, we survey recent results pertaining to the integration of logic, probability and actions in the situation calculus, which is arguably one of the oldest and most well-known formalisms. We then explore reduction theorems and programming interfaces for the language. These results are motivated in the context of cognitive robotics (as envisioned by Reiter and his colleagues) for the sake of concreteness. Overall, the advantage of proving results for such a general language is that it becomes possible to adapt them to any special-purpose fragment, including but not limited to popular probabilistic relational models

    Developmental profile of localized spontaneous Ca2+ release events in the dendrites of rat hippocampal pyramidal neurons

    Get PDF
    Author Posting. © The Author(s), 2012. This is the author's version of the work. It is posted here by permission of Elsevier B.V. for personal use, not for redistribution. The definitive version was published in Cell Calcium 52 (2012): 422-432, doi:10.1016/j.ceca.2012.08.001.Recent experiments demonstrate that localized spontaneous Ca2+ release events can be detected in the dendrites of pyramidal cells in the hippocampus and other neurons (J. Neurosci. 29:7833-7845, 2009). These events have some properties that resemble ryanodine receptor mediated “sparks” in myocytes, and some that resemble IP3 receptor mediated “puffs” in oocytes. They can be detected in the dendrites of rats of all tested ages between P3 and P80 (with sparser sampling in older rats), suggesting that they serve a general signaling function and are not just important in development. However, in younger rats the amplitudes of the events are larger than the amplitudes in older animals and almost as large as the amplitudes of Ca2+ signals from backpropagating action potentials (bAPs). The rise time of the event signal is fast at all ages and is comparable to the rise time of the bAP fluorescence signal at the same dendritic location. The decay time is slower in younger animals, primarily because of weaker Ca2+ extrusion mechanisms at that age. Diffusion away from a brief localized source is the major determinant of decay at all ages. A simple computational model closely simulates these events with extrusion rate the only age dependent variable.Supported in part by NIH grant NS-016295

    Green function techniques in the treatment of quantum transport at the molecular scale

    Full text link
    The theoretical investigation of charge (and spin) transport at nanometer length scales requires the use of advanced and powerful techniques able to deal with the dynamical properties of the relevant physical systems, to explicitly include out-of-equilibrium situations typical for electrical/heat transport as well as to take into account interaction effects in a systematic way. Equilibrium Green function techniques and their extension to non-equilibrium situations via the Keldysh formalism build one of the pillars of current state-of-the-art approaches to quantum transport which have been implemented in both model Hamiltonian formulations and first-principle methodologies. We offer a tutorial overview of the applications of Green functions to deal with some fundamental aspects of charge transport at the nanoscale, mainly focusing on applications to model Hamiltonian formulations.Comment: Tutorial review, LaTeX, 129 pages, 41 figures, 300 references, submitted to Springer series "Lecture Notes in Physics
    corecore